Introduction

Row

Overview

For this project, we will follow the DCOVAC process. The process is listed below:

DCOVAC – THE DATA MODELING FRAMEWORK

  • DEFINE the Problem
  • COLLECT the Data from Appropriate Sources
  • ORGANIZE the Data Collected
  • VISUALIZE the Data by Developing Charts
  • ANALYZE the data with Appropriate Statistical Methods
  • COMMUNICATE your Results

Row

The Problem & Data Collection

The Problem

This dataset provides insights into energy efficiency meter evaluated projects for residential existing homes in New York state from 2007-2012. We will examine the variables in the dataset to determine what helps to predict the changes in energy efficiency in these homes.

The Data

This dataset has 2696 rows and 20 variables.

Data Sources

The Devastator. (n.d.). Residential Home Energy Efficiency [Dataset].

The Data

VARIABLES TO PREDICT WITH

  • home_size: The size of the home in which the project was completed.
  • home_vol: The volume of the home in which the project was completed.
  • num_units: The number of units in the home in which the project was completed.
  • home_year: The year the home in which the project was completed was built.
  • total_cost: The total cost of the project.
  • contractor_incentive: The incentive given to the contractor for completing the project.
  • total_incentive: The total incentives given for the project.
  • elec_base_kwh: The baseline electric usage in kWh.
  • gas_base_mmbtu: The baseline gas usage in MMBtu.
  • elec_report: The reported electric usage in kWh.
  • gas_report: The reported gas usage in MMBtu.
  • elec_eval_save_kwh: The evaluated electric usage in kWh.
  • gas_eval_save_mmbtu: The evaluated gas usage in MMBtu.
  • National_Grid: Whether the customer subscribes to National Grid.
  • National_Fuel_Gas: Whether the customer subscribes to National Fuel and Gas.
  • NYSEG: Whether the customer subscribes to NYSEG.
  • Orange_Rockland: Whether the customer subscribes to Orange & Rockland.
  • Rochester_Gas_Electric: Whether the customer subscribes to Rochester Gas & Electric.

VARIABLES WE WANT TO PREDICT

  • est_y1_save: The estimated first year energy bill savings in dollars.
  • cust_type: The type of customer the project was completed for, market(1) or assisted(0).

Data

Column

Organize the Data

Organizing data can also include summarizing data values in simple one-way and two-way tables.

   cust_type        home_size        home_vol       num_units    
 Min.   :0.0000   Min.   :  520   Min.   : 4160   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.: 1352   1st Qu.:10752   1st Qu.:1.000  
 Median :1.0000   Median : 1788   Median :14208   Median :1.000  
 Mean   :0.7352   Mean   : 1946   Mean   :15569   Mean   :1.034  
 3rd Qu.:1.0000   3rd Qu.: 2308   3rd Qu.:18432   3rd Qu.:1.000  
 Max.   :1.0000   Max.   :10260   Max.   :82080   Max.   :4.000  
                                                  NA's   :131    
   home_year      total_cost    contractor_incentive total_incentive  
 Min.   :1756   Min.   :  500   Min.   :   0.0       Min.   :    0.0  
 1st Qu.:1940   1st Qu.: 4664   1st Qu.: 224.5       1st Qu.:  422.5  
 Median :1955   Median : 7200   Median : 362.2       Median :  800.0  
 Mean   :1951   Mean   : 8223   Mean   : 360.9       Mean   : 1565.5  
 3rd Qu.:1970   3rd Qu.:10471   3rd Qu.: 500.0       3rd Qu.: 2010.8  
 Max.   :2010   Max.   :44245   Max.   :1050.0       Max.   :10000.0  
 NA's   :33                                                           
  est_y1_save      elec_base_kwh   gas_base_mmbtu    elec_report   
 Min.   :   1.19   Min.   :    0   Min.   :  0.00   Min.   :    0  
 1st Qu.: 223.75   1st Qu.: 2393   1st Qu.:  0.00   1st Qu.: 2490  
 Median : 501.18   Median : 6280   Median : 58.72   Median : 6082  
 Mean   : 653.42   Mean   : 6597   Mean   : 58.19   Mean   : 6387  
 3rd Qu.: 906.30   3rd Qu.: 9632   3rd Qu.:104.48   3rd Qu.: 9315  
 Max.   :5919.87   Max.   :45372   Max.   :398.63   Max.   :42521  
                                                                   
   gas_report     elec_eval_save_kwh gas_eval_save_mmbtu National_Grid   
 Min.   :  0.00   Min.   :-11584.5   Min.   :-180.170    Min.   :0.0000  
 1st Qu.:  0.00   1st Qu.:  -294.8   1st Qu.:   0.000    1st Qu.:1.0000  
 Median : 50.73   Median :     0.0   Median :   0.000    Median :1.0000  
 Mean   : 50.12   Mean   :   210.1   Mean   :   8.065    Mean   :0.7622  
 3rd Qu.: 89.08   3rd Qu.:   689.3   3rd Qu.:  14.627    3rd Qu.:1.0000  
 Max.   :362.63   Max.   : 16589.6   Max.   : 184.510    Max.   :1.0000  
                                                                         
 National_Fuel_Gas     NYSEG         Orange_Rockland Rochester_Gas_Electric
 Min.   :0.0000    Min.   :0.00000   Min.   :0.000   Min.   :0.00000       
 1st Qu.:0.0000    1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.00000       
 Median :0.0000    Median :0.00000   Median :0.000   Median :0.00000       
 Mean   :0.1725    Mean   :0.03264   Mean   :0.023   Mean   :0.01595       
 3rd Qu.:0.0000    3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:0.00000       
 Max.   :1.0000    Max.   :1.00000   Max.   :1.000   Max.   :1.00000       
                                                                           

All variables that are not pertinent to the dependent variables, such as zip code, locations, and indexes, have been removed. We can see that some of the variables have many null values.

Column

Transform Variables

cust_type is a 1 if they are market, and zero if assisted. This will be transformed into a categorical variable. Each of the different providers will be converted into true = 1, false = 0.

cust_type (Market or Assited?)

# A tibble: 2 × 2
  cust_type     n
  <chr>     <int>
1 0           714
2 1          1982

cust_type (Market or Assisted)

Data Vizualization #1

Column

Response Variables

cust_type Market(1)/Assisted(0)

We can see we have about 73% of the variable as market customers. Looking at the potential predictors of cust_type, we predict the strongest relationships between total_incentive and home_size.

Column

Transform Variables

Data Vizualization #2

Column

Response Variables

est_y1_save

We see the largest concentration of values around $0 - $1000. We see the two strongest relationships between home_year and num_units. The data is skewed right.

Column

Transform Variables

est_y1_save Analysis

Row

Predict Median Value

For this analysis we will use a Linear Regression Model.

Adjusted R-Squared

38 %

RMSE

443.89

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
(Intercept) 9401.866 586.106 16.041 0.000
home_year -4.677 0.297 -15.731 0.000
total_incentive 0.083 0.006 13.826 0.000
total_cost 0.024 0.003 8.649 0.000
Orange_Rockland1 -520.054 79.850 -6.513 0.000
National_Fuel_Gas1 -336.519 58.227 -5.779 0.000
Rochester_Gas_Electric1 -494.876 85.701 -5.774 0.000
National_Grid1 -293.880 54.024 -5.440 0.000
num_units 166.685 43.937 3.794 0.000
contractor_incentive 0.214 0.072 2.969 0.003
elec_eval_save_kwh 3389.323 1930.772 1.755 0.079
elec_base_kwh -3389.288 1930.773 -1.755 0.079
elec_report 3389.286 1930.773 1.755 0.079
home_size 0.036 0.050 0.718 0.473
gas_report -1464.777 2418.132 -0.606 0.545
gas_base_mmbtu 1463.823 2418.131 0.605 0.545
gas_eval_save_mmbtu -1462.800 2418.127 -0.605 0.545
home_vol 0.003 0.006 0.551 0.582
NYSEG1 -18.943 52.700 -0.359 0.719

Residual Assumptions Explorations

Row

Analysis Summary

After examining this model, we determine that there are some predictors that are not important in prediciting the estimated savings, so a pruned version of the model is created by removing predictors that are not significant.

Row

Predict Median Value Final Version

For this analysis we will use a pruned Linear Regression Model. We removed elec_eval_save_kwh, elec_base_kwh, elec_report, home_size, gas_report, gas_base_mmbtu,gas_eval_save_mmbtu, home_vol, and NYSEG.

Adjusted R-Squared

36 %

RMSE

449.41

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
(Intercept) 9500.708 587.735 16.165 0.000
home_year -4.597 0.299 -15.388 0.000
total_cost 0.031 0.003 10.308 0.000
National_Fuel_Gas1 -399.360 56.437 -7.076 0.000
Orange_Rockland1 -548.489 78.624 -6.976 0.000
Rochester_Gas_Electric1 -493.196 86.173 -5.723 0.000
National_Grid1 -301.768 53.853 -5.604 0.000
cust_type1 -237.908 47.651 -4.993 0.000
num_units 201.917 44.193 4.569 0.000
contractor_incentive 0.260 0.073 3.591 0.000
total_incentive 0.021 0.013 1.577 0.115

Residual Assumptions Explorations

Row

Analysis Summary

After examining this model, looking at the residual plots we can see that there are some issues with our data. The high values at the right of the Q-Q plot could demonstrate some non-linearity or outliers in the data. We can see the same three points in our residuals vs fitted chart. We could transform the data or try some additional models to try to improve this fit.

Reducing the predictors that did not help with prediction of the estimated savings lowered our r^2 and increased our RMSE.

From the following table, we can see the effect on Median Value by the predictor variables.
Variable Direction
home_year Decrease
total_cost Increase
National_Fuel_Gas1 Decrease
Orange_Rockland1 Decrease
Rochester_Gas_Electric1 Decrease
National_Grid1 Decrease
cust_type1 Decrease
num_units Increase
contractor_incentive Increase
total_incentive Increase

cust_type Analysis

Row

Predict Customer Type

Row ———————————————————————–

Analysis Summary

Examining these models, the Neural Network model performed the most accurately to predict the validation data. The decision tree also performed well in its prediction of the validation data, and we can easily tell how it was able to come to these predictions. It seems that total incentive and total cost were the two largest determining factors of whether the customer was a market customer or an assisted customer.

Conclusion

Summary

In conclusion, we see that our predictors can accurately predict what type of customer a person is, but is less successful at accurately predicting what the estimated cost for renovations would be.

---
title: "Heating System Renovation"
author: "Aiden Belt"
date: "2023-06-04"
output:
  flexdashboard::flex_dashboard:
    vertical_layout: scroll
    source_code: embed
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
library(readxl)
```
```{r load_data}
df <- read_excel("ProjectData_belt.xlsx", sheet = "Clean Data")
```

Introduction {data-orientation=rows}
=======================================================================

Row {data-height=250}
-----------------------------------------------------------------------

### Overview 

For this project, we will follow the DCOVAC process. The process is listed below:

DCOVAC – THE DATA MODELING FRAMEWORK

* DEFINE the Problem
* COLLECT the Data from Appropriate Sources
* ORGANIZE the Data Collected
* VISUALIZE the Data by Developing Charts
* ANALYZE the data with Appropriate Statistical Methods
* COMMUNICATE your Results

Row {data-height=650}
-----------------------------------------------------------------------

### The Problem & Data Collection

#### The Problem
This dataset provides insights into energy efficiency meter evaluated projects for residential existing homes in New York state from 2007-2012. We will examine the variables in the dataset to determine what helps to predict the changes in energy efficiency in these homes.

#### The Data
This dataset has 2696 rows and 20 variables.

#### Data Sources
The Devastator. (n.d.). Residential Home Energy Efficiency [Dataset].


### The Data
VARIABLES TO PREDICT WITH

* *home_size*: The size of the home in which the project was completed.
* *home_vol*: The volume of the home in which the project was completed. 
* *num_units*: The number of units in the home in which the project was completed.
* *home_year*: The year the home in which the project was completed was built.
* *total_cost*: The total cost of the project.
* *contractor_incentive*: The incentive given to the contractor for completing the project.
* *total_incentive*: The total incentives given for the project. 
* *elec_base_kwh*: The baseline electric usage in kWh.
* *gas_base_mmbtu*: The baseline gas usage in MMBtu.
* *elec_report*: The reported electric usage in kWh. 
* *gas_report*: The reported gas usage in MMBtu.
* *elec_eval_save_kwh*: The evaluated electric usage in kWh. 
* *gas_eval_save_mmbtu*: The evaluated gas usage in MMBtu. 
* *National_Grid*: Whether the customer subscribes to National Grid. 
* *National_Fuel_Gas*: Whether the customer subscribes to National Fuel and Gas.
* *NYSEG*: Whether the customer subscribes to NYSEG.
* *Orange_Rockland*: Whether the customer subscribes to Orange & Rockland.
* *Rochester_Gas_Electric*: Whether the customer subscribes to Rochester Gas & Electric.

VARIABLES WE WANT TO PREDICT

* *est_y1_save*: The estimated first year energy bill savings in dollars.
* *cust_type*: The type of customer the project was completed for, market(1) or assisted(0). 


Data {data-orientation=rows}
=======================================================================

Column
-----------------------------------------------------------------------

### Organize the Data
Organizing data can also include summarizing data values in simple one-way and two-way tables.

```{r, cache=TRUE}
summary(df)
```
All variables that are not pertinent to the dependent variables, such as zip code, locations, and indexes, have been removed. We can see that some of the variables have many null values.

Column
-----------------------------------------------------------------------
### Transform Variables
cust_type is a 1 if they are market, and zero if assisted. This will be transformed into a categorical variable. Each of the different providers will be converted into true = 1, false = 0. 

```{r, cache=TRUE}
df <- mutate(df,cust_type=as.factor(cust_type),
             National_Grid=as.factor(National_Grid),
             National_Fuel_Gas=as.factor(National_Fuel_Gas),
             NYSEG=as.factor(NYSEG),
             Orange_Rockland=as.factor(Orange_Rockland),
             Rochester_Gas_Electric=as.factor(Rochester_Gas_Electric)
             )
```
#### cust_type (Market or Assited?)
```{r, cache=TRUE}
as_tibble (select(df,cust_type) %>%
  table())
```
#### cust_type (Market or Assisted)

![](DIST_cust_type.png)



Data Vizualization #1
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables
#### cust_type Market(1)/Assisted(0)

```{r, cache=TRUE}
as_tibble(select(df,cust_type) %>%
         table()) %>% #rename(CAT.MEDV=".") %>% 
  ggplot(aes(y=n,x=cust_type)) + geom_bar(stat="identity")
```
We can see we have about 73% of the variable as market customers. Looking at the potential predictors of cust_type, we predict the strongest relationships between total_incentive and home_size.

Column {data-width=500}
-----------------------------------------------------------------------

### Transform Variables
```{r, cache=TRUE}
ggpairs(select(df, cust_type, home_year, contractor_incentive, home_size, home_vol, num_units, total_incentive))
```

Data Vizualization #2
=======================================================================

Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables

#### est_y1_save
```{r, cache=TRUE}
ggplot(df, aes(est_y1_save)) + geom_histogram(bins=20)
```
We see the largest concentration of values around $0 - $1000. We see the two strongest relationships between home_year and num_units. The data is skewed right. 

Column {data-width=500}
-----------------------------------------------------------------------

### Transform Variables
```{r, cache=TRUE}
ggpairs(select(df, est_y1_save, home_size, home_vol, num_units, home_year, elec_base_kwh, gas_base_mmbtu))
```


est_y1_save Analysis {data-orientation=rows}
=======================================================================

Row
-----------------------------------------------------------------------

### Predict Median Value
For this analysis we will use a Linear Regression Model.
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
est_y1_save_lm <- lm(est_y1_save ~ . -cust_type,data = df)
summary(est_y1_save_lm)
```

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(est_y1_save_lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(est_y1_save_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-down")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(est_y1_save_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-down")
```
Row
-----------------------------------------------------------------------

### Regression Output
```{r,include=FALSE, cache=TRUE}
knitr::kable(summary(est_y1_save_lm)$coef, digits = 3) #pretty table output
summary(est_y1_save_lm)$coef
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(est_y1_save_lm))[,4])  
out <- coef(summary(est_y1_save_lm))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(est_y1_save_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```

Row
-----------------------------------------------------------------------

### Analysis Summary
After examining this model, we determine that there are some predictors that are not important in prediciting the estimated savings, so a pruned version of the model is created by removing predictors that are not significant.

Row
-----------------------------------------------------------------------

### Predict Median Value Final Version
For this analysis we will use a pruned Linear Regression Model. We removed elec_eval_save_kwh, elec_base_kwh, elec_report, home_size, gas_report, gas_base_mmbtu,gas_eval_save_mmbtu, home_vol, and NYSEG.  
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
est_y1_save_lm <- lm(est_y1_save ~ . -elec_eval_save_kwh -elec_base_kwh -elec_report -home_size -gas_report -gas_base_mmbtu -gas_eval_save_mmbtu -home_vol -NYSEG ,data = df)
summary(est_y1_save_lm)
```

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(est_y1_save_lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(est_y1_save_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-down")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(est_y1_save_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-down")
```
Row
-----------------------------------------------------------------------

### Regression Output
```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(est_y1_save_lm)$coef, digits = 3) #pretty table output
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(est_y1_save_lm))[,4])  
out <- coef(summary(est_y1_save_lm))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(est_y1_save_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```

Row
-----------------------------------------------------------------------

### Analysis Summary

After examining this model, looking at the residual plots we can see that there are some issues with our data. The high values at the right of the Q-Q plot could demonstrate some non-linearity or outliers in the data. We can see the same three points in our residuals vs fitted chart. We could transform the data or try some additional models to try to improve this fit.

Reducing the predictors that did not help with prediction of the estimated savings lowered our r^2 and increased our RMSE.

From the following table, we can see the effect on Median Value by the predictor variables.
```{r, cache=TRUE}
#create table summary of predictor changes
predchang = data_frame(
  Variable = c('home_year', 'total_cost','National_Fuel_Gas1','Orange_Rockland1','Rochester_Gas_Electric1','National_Grid1','cust_type1', 'num_units', 'contractor_incentive', 'total_incentive'),
  Direction = c('Decrease','Increase','Decrease', 'Decrease','Decrease','Decrease', 'Decrease', 'Increase', 'Increase', 'Increase')
)
knitr::kable(predchang) #pretty table output
```
cust_type Analysis {data-orientation=rows}
=======================================================================

Row {data-height=900}
-----------------------------------------------------------------------

### Predict Customer Type
![](cust_type_models.png)
Row
-----------------------------------------------------------------------

### Analysis Summary
Examining these models, the Neural Network model performed the most accurately to predict the validation data. The decision tree also performed well in its prediction of the validation data, and we can easily tell how it was able to come to these predictions. It seems that total incentive and total cost were the two largest determining factors of whether the customer was a market customer or an assisted customer.  

Conclusion
=======================================================================
### Summary

In conclusion, we see that our predictors can accurately predict what type of customer a person is, but is less successful at accurately predicting what the estimated cost for renovations would be.